MULTEXT-East Resources for Serbian
نویسندگان
چکیده
The paper presents the MULTEXT-East language resources for the Serbian language. MULTEXT-East is a multilingual dataset for language engineering research and development. This standardised and linked set of resources covers a large number of mainly Central and Eastern European languages and includes the EAGLES-based morphosyntactic specifications, defining the features that describe wordlevel syntactic annotations; medium scale morphosyntactic lexica; and annotated parallel, comparable, and speech corpora. The most important component is the linguistically annotated corpus consisting of Orwell’s novel “1984” in the English original and translations. MULTEXT-East has already seen several editions, with the latest one being Version 3, where the most important addition has been that of Serbian language resources. The paper presents MULTEXT-East Version 3 with special emphasis on the Serbian components, namely the structurally annotated “1984”, the morphosyntactic specifications, the morphosyntactic lexicon and the linguistically annotated “1984”. The complete dataset, unique in terms of languages and the wealth of encoding, is extensively documented, and freely available for
منابع مشابه
MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora
The paper presents the fourth, “Mondilex” edition of the MULTEXT-East language resources, a multilingual dataset for language engineering research and development, focused on the morphosyntactic level of linguistic description. This standardised and linked set of resources covers a large number of mainly Central and Eastern European languages and includes the EAGLES-based morphosyntactic specif...
متن کاملMULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora
The paper presents the third edition of the MULTEXT-East language resources, a multilingual dataset for language engineering research and development. This standardised and linked set of resources covers a large number of mainly Central and Eastern European languages and includes the EAGLES-based morphosyntactic specifications, defining the features that describe word-level syntactic annotation...
متن کاملOrwell’s 1984 – the Case of Serbian Revisited
In this paper we present an alternative version of the morphosyntactically annotated Serbian translation of 1984. This version follows the basic principles of MULTEXT-East version, except for one addition – the text will be annotated with multi-word units as well. We will present the resources used for annotation with multi-word units and explain how these resources were enriched with multi-wor...
متن کاملThe MULTEXT-East Morphosyntactic Specifications for Slavic Languages
Word-level morphosyntactic descriptions, such as “Ncmsn” designating a common masculine singular noun in the nominative, have been developed for all Slavic languages, yet there have been few attempts to arrive at a proposal that would be harmonised across the languages. Standardisation adds to the interchange potential of the resources, making it easier to develop multilingual applications or t...
متن کاملThe MULTEXT-East Morphosyntactic Specification for Slavic Languages
Word-level morphosyntactic descriptions, such as “Ncmsn” designating a common masculine singular noun in the nominative, have been developed for all Slavic languages, yet there have been few attempts to arrive at a proposal that would be harmonised across the languages. Standardisation adds to the interchange potential of the resources, making it easier to develop multilingual applications or t...
متن کامل